fix: lead responsiveness (error suppression, circuit breaker, routing, auth) by btucker · Pull Request #2328 · btucker/midtown

btucker · 2026-04-04T12:51:55Z

Summary

Six fixes for channel lead responsiveness, found via dogfood testing:

1. Suppress error output from auto-posting to channels

Session error text (e.g. expired OAuth 401) was auto-posted as normal output

2. Add circuit breaker for lead spawn failures

Leads had no max retry limit — crash-looped forever
MAX_LEAD_RESTARTS=3, escalates to ops

3. Topic channel @ALL now finds workers in dm channels

Workers in dm-{name} were invisible to channel-scoped @all

4+5. Nudge resume/respawn respects spawn failure cooldowns

Both ResumeAndDeliver and RespawnAndDeliver now check cooldowns

6. Replace broken OAuth code flow with direct `claude auth login`

Old flow (BROWSER=false + paste code) didn't work
New flow: Login button runs claude auth login directly, opens native browser
Removed: /api/auth/login/code, AuthLoginProcess, pending_auth_login

Spec updates: v2-spec.md §1.4, §4.1, §4.4

Test plan

flush_auto_output_suppresses_error_results / _posts_normal_results
lead_gives_up_after_max_spawn_failures
at_all_in_topic_finds_workers_in_dm_channels
stopped_lead_with_active_cooldown_resolves_to_drop
stopped_worker_with_active_cooldown_resolves_to_drop
stopped_lead_with_session_id_and_active_cooldown_resolves_to_drop
Full test suite: 1534 lib tests, 0 failures
cargo clippy + cargo fmt clean

🤖 Generated with Claude Code

When a Claude session exits with an error (e.g. expired OAuth token), the error text was being auto-posted to the channel as if it were normal assistant output. This caused raw JSON API errors to appear in user-facing channels. Now, when flush_auto_output detects a Result event with is_error: true, it clears the pending events instead of posting them. Errors are still logged to daemon.log for debugging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-04-04T12:54:26Z

Codecov Report

❌ Patch coverage is 66.66667% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.93%. Comparing base (bc6b661) to head (a7d5ede).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/daemon_v2/web/routes.rs	0.00%	17 Missing ⚠️
src/daemon_v2/executor/mod.rs	84.37%	5 Missing ⚠️
src/daemon_v2/daemon.rs	55.55%	4 Missing ⚠️
src/daemon_v2/decisions/health.rs	88.23%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2328      +/-   ##
==========================================
- Coverage   64.00%   63.93%   -0.07%     
==========================================
  Files         100      100              
  Lines       38588    38578      -10     
==========================================
- Hits        24699    24666      -33     
- Misses      13889    13912      +23

Files with missing lines	Coverage Δ
src/daemon_v2/decisions/chat.rs	`96.56% <100.00%> (+0.13%)`	⬆️
src/daemon_v2/web/mod.rs	`100.00% <ø> (ø)`
src/daemon_v2/decisions/health.rs	`88.78% <88.23%> (+0.48%)`	⬆️
src/daemon_v2/daemon.rs	`66.53% <55.55%> (-4.22%)`	⬇️
src/daemon_v2/executor/mod.rs	`62.92% <84.37%> (-9.21%)`	⬇️
src/daemon_v2/web/routes.rs	`3.17% <0.00%> (+0.20%)`	⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7292e8777a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-04T12:55:09Z

+    let has_error_result = events.iter().any(|e| {
+        matches!(
+            e,
+            crate::headless::StreamEvent::Result { is_error: true, .. }
+        )


Defer error suppression until turn end

flush_auto_output suppresses only when the current events batch already contains StreamEvent::Result { is_error: true }, but drain_session_output also flushes on the 2-second timer before the result event arrives. In error cases where an assistant error message is emitted first and the timer fires before the Result, this branch is bypassed and the raw error text is posted; the later error result then clears an empty buffer. That means the new “suppress error output” behavior is still violated under this timing.

Useful? React with 👍 / 👎.

Channel leads had no max retry limit — when a lead's auth expired, it would crash, wait 120s, crash again, and repeat forever. Workers already had MAX_WORKER_RESTARTS=3 with ops escalation, but leads were missing this protection. Now ensure_lead_for_channel checks failure_count and stops retrying after MAX_LEAD_RESTARTS=3 consecutive failures, posting to ops instead. The daemon also uses the correct per-kind max and escalation key format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Workers are spawned with channel=dm-{name}, not the task's channel. This meant @ALL in a topic channel only found agents indexed under that channel (the lead), missing workers whose tasks belong there. The fix adds a second scan after the by_channel lookup that finds workers by checking task.channel instead of agent.channel. The nudge dedup set prevents double-nudging workers that happen to be indexed in the correct channel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

resolve_nudge_action returned RespawnAndDeliver for any stopped agent with no session_id, regardless of whether a SpawnFailure cooldown was active. This meant every user message to a dead lead triggered a new spawn attempt, bypassing both the 120s cooldown and the circuit breaker (MAX_LEAD_RESTARTS) from ensure_channel_leads_alive. Now resolve_nudge_action checks is_active(SpawnFailure, key) before returning RespawnAndDeliver. Leads use channel as key, workers use task_id. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The Cycle 4 fix only added cooldown checks to RespawnAndDeliver (no session_id). ResumeAndDeliver (has session_id) was unchecked, meaning agents with preserved session_ids (e.g. auth errors that don't trigger "No conversation found") could be resumed on every user message. Moved the cooldown check before the session_id branch so it applies to both resume and respawn paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The web UI's Re-authenticate button triggered an OAuth code flow (BROWSER=false, capture URL, paste code) that didn't work in practice. The only reliable auth method is running `claude auth login` directly, which opens the browser natively for OAuth. Changes: - Backend: simplified auth_login route to run `claude auth login` without BROWSER=false, waits up to 5 min for completion, then restarts all agents - Removed: /api/auth/login/code endpoint, AuthLoginProcess struct, pending_auth_login state - UI: replaced code-paste flow with simple Login button that shows "Logging in" state while waiting for browser auth to complete Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Addresses Codex review: flush_auto_output's error check only worked when the Result event was in the same batch as the error text. If the 2-second timer fired between the error text and the Result, the text leaked through. Moved error suppression from flush_auto_output to the drain loop: - Set session_errored flag when Result { is_error: true } is received - Skip timer flushes, stream-end flushes, and turn-end flushes when the flag is set - flush_auto_output is now a pure extract-and-post function Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

btucker · 2026-04-04T14:38:45Z

Addressed Codex review in bf69bb3:

P1 — Defer error suppression until turn end: Moved the error check from flush_auto_output to a session_errored flag in the drain loop. The flag is set when Result { is_error: true } arrives, and all subsequent flushes (timer, stream-end, turn-end) are suppressed. This closes the race where the 2-second timer could fire between the error text and the Result event.

Pre-existing biome error exposed by touching Channel.svelte. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pre-existing biome noNonNullAssertion error in store.test.ts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Biome noNonNullAssertion errors — replace ! with optional chaining or nullish coalescing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Downgrade noNonNullAssertion to warn for .svelte files (reactive code uses $store! assertions extensively and they are valid at runtime) - Apply biome auto-fix for Channel.svelte optional chaining - Apply biome format fix for api.ts startAuthLogin signature Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AccountPanel.svelte also had the old startAuthLogin + submitAuthCode flow. Updated to use the simplified direct-login approach (matching Channel.svelte changes). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector bot reviewed Apr 4, 2026

View reviewed changes

btucker changed the title ~~fix: suppress error output from being auto-posted to channels~~ fix: suppress error auto-output + add lead spawn circuit breaker Apr 4, 2026

btucker changed the title ~~fix: suppress error auto-output + add lead spawn circuit breaker~~ fix: error suppression, lead circuit breaker, and topic @all routing Apr 4, 2026

btucker changed the title ~~fix: error suppression, lead circuit breaker, and topic @all routing~~ fix: error suppression, lead circuit breaker, @all routing, nudge cooldowns Apr 4, 2026

btucker and others added 2 commits April 4, 2026 08:24

btucker changed the title ~~fix: error suppression, lead circuit breaker, @all routing, nudge cooldowns~~ fix: lead responsiveness (error suppression, circuit breaker, routing, auth) Apr 4, 2026

btucker and others added 5 commits April 4, 2026 09:40

fix(web): use optional chaining for cmd.execute call

e47b999

Pre-existing biome error exposed by touching Channel.svelte. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(web): replace non-null assertion with nullish coalescing in test

dcd31ff

Pre-existing biome noNonNullAssertion error in store.test.ts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(web): remove all non-null assertions from test files

0893229

Biome noNonNullAssertion errors — replace ! with optional chaining or nullish coalescing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

btucker merged commit b47707c into main Apr 4, 2026
13 checks passed

btucker deleted the blindspots/dogfood/20260404-073834 branch April 4, 2026 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: lead responsiveness (error suppression, circuit breaker, routing, auth)#2328

fix: lead responsiveness (error suppression, circuit breaker, routing, auth)#2328
btucker merged 12 commits intomainfrom
blindspots/dogfood/20260404-073834

btucker commented Apr 4, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 4, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 4, 2026

Uh oh!

btucker commented Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

btucker commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Suppress error output from auto-posting to channels

2. Add circuit breaker for lead spawn failures

3. Topic channel @ALL now finds workers in dm channels

4+5. Nudge resume/respawn respects spawn failure cooldowns

6. Replace broken OAuth code flow with direct claude auth login

Test plan

Uh oh!

codecov bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

btucker commented Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

btucker commented Apr 4, 2026 •

edited

Loading

6. Replace broken OAuth code flow with direct `claude auth login`

codecov bot commented Apr 4, 2026 •

edited

Loading